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Abstract 

Inferring phylogenetic relationships between closely related taxa can be hindered 
by three factors: (1) the lack of informative molecular variation at short evolu- 
tionary timescale; (2) the lack of established markers in poorly studied taxa; and 
(3) the potential phylogenetic conflicts among different genomic regions due to 
incomplete lineage sorting or introgression. In this context, Restriction site Asso- 
ciated DNA sequencing (RAD-seq) seems promising as this technique can gener- 
ate sequence data from numerous DNA fragments scattered throughout the 
genome, from a large number of samples, and without preliminary knowledge on 
the taxa under study. However, divergence beyond the within-species level will 
necessarily reduce the number of conserved and non-duplicated restriction sites, 
and therefore the number of loci usable for phylogenetic inference. Here, we 
assess the suitability of RAD-seq for phylogeny using a simulated experiment on 
the 12 Drosophila genomes, with divergence times ranging from 5 to 63 million 
years. These simulations show that RAD-seq allows the recovery of the known 
Drosophila phylogeny with strong statistical support, even for relatively ancient 
nodes. Notably, this conclusion is robust to the potentially confounding effects of 
sequencing errors, heterozygosity, and low coverage. We further show that clus- 
tering RAD-seq data using the BLASTN and SiLiX programs significantly 
improves the recovery of orthologous RAD loci compared with previously pro- 
posed approaches, especially for distantly related species. This study therefore val- 
idates the view that RAD sequencing is a powerful tool for phylogenetic inference. 



Introduction 

Resolution of phylogenies between closely related species 
can be problematic for a number of reasons. First, due to 
incomplete lineage sorting and introgression, different loci 
might trace different evolutionary histories. In addition, 
most nuclear markers lack resolution at short evolution- 
ary scales. And finally, in poorly studied taxa, molecular 
markers might not have been developed. In this context, 
Restriction site Associated DNA sequencing (RAD-seq) 
appears as a promising approach. This technique relies on 
the high throughput sequencing of genomic regions flank- 
ing restriction sites (Baird et al. 2008; Davey et al. 2010; 
McCormack et al. 2013; Rowe et al. 2011). It thus gener- 
ates numerous homologous markers, scattered throughout 
genomes, potentially from hundreds of specimens in a 
single sequencing run. Moreover, if the precise number of 
loci is not critical, this approach is potentially universal, 



as it does not require preliminary knowledge on the taxa 
under study (Baxter et al. 2011). 

The RAD-seq technique was initially designed to gener- 
ate informative molecular variation within species, and 
has repeatedly proved its efficiency for this purpose (e.g., 
Emerson et al. 2010; Hohenlohe et al. 2010). However, 
the utilization of RAD-seq to compare genomes from dif- 
ferent species can potentially be hindered by a number of 
caveats. Specifically, the following conditions must be ful- 
filled for RAD-seq to be suitable for phylogenetic infer- 
ence: (1) enough restriction sites must be conserved 
between species; (2) the flanking regions must be suffi- 
ciently conserved for homology to be detectable by 
sequence similarity; and (3) the resulting alignments must 
contain enough phylogenetic signal. 

Recent studies, published as this article was in prepara- 
tion, concur to suggest that these conditions are often ful- 
filled (Rubin et al. 2012; Wagner et al. 2012). Here, we 
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use a simulated RAD-seq experiment on the 12 Drosoph- 
ila genomes (Clark et al. 2007) to assess if this is the case 
at various degrees of molecular divergence, and to opti- 
mize the procedure of RAD data analysis for phylogeny. 
Consistently with Rubin et al. (2012), we were able to 
recover and align enough sequences from orthologous 
loci to reconstruct the known (whole-genome-based) phy- 
logeny (Clark et al. 2007), with good statistical support. 
We further show here that sequence clustering based on 
the BLASTN and SiLiX programs significantly improves 
the recovery of orthologous RAD loci. This study there- 
fore validates and reinforces RAD-seq as a powerful tool 
for phylogenetic inference. 

Divergence time and RAD loci 
conservation 

The number of RAD loci potentially usable to compare 
two specimens is the number of restriction sites conserved 
in their genomes, which is expected to decrease as diver- 
gence time increases. We used the 12 Drosophila genomes 
to establish the relationship between genome divergence 
and restriction sites conservation. Complete genomes and 
pairwise alignments of the D. melanogaster genome with 
each of the 11 others were downloaded from http://gen- 
ome.ucsc.edu/. Conserved target sites of the restriction 
enzyme Sbfl were counted in each pairwise alignment. 
This enzyme is one of the most commonly used in RAD- 
seq studies because its 8 bp, GC-rich recognition site is 
rare enough in genomes to maximize sequencing coverage 
for each locus. 

Figure 1 shows the relationship between restriction site 
conservation and sequence divergence (measured at four- 
fold degenerated sites of codons), with divergence times 
ranging from 5.4 to 63 My (Tamura et al. 2004). The 
genome of D. melanogaster contains 2,948 sites, 49.1% 
and 50.5% of which are conserved in its two closest rela- 
tives D. simulans and D. sechellia. After 63 My of diver- 
gence, D. melanogaster, respectively, shares 4.9, 4.8, and 
5.1% of its restriction sites with D. grimshawi, D. mojav- 
ensis, and D. virilis, which represent 145, 142, and 149 
restriction sites. Notably, these estimates are conservative 
as the 2,948 restriction sites in D. melanogaster are 
counted from its complete genome, while conserved sites 
are only detected in the aligned genomic regions. 

How many RAD tags are usable for 
phylogeny? 

The above results suggest that more than 100 Sbfl cut 
sites are conserved between species having diverged for 
63 My (Fig. 1). However, for the genomic regions flank- 
ing these sites to be usable for phylogenetic inference, 
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Figure 1. Relationship between molecular divergence at fourfold 
degenerate sites {Drosophila consortium, 2007) and Sbfl restriction 
site conservation. Note: genomic regions aligned more than once 
against the D. melanogaster genome were excluded. Numbers next to 
crosses indicate divergence time. 

sequences must be sufficiently conserved for homology to 
be detected. Moreover, paralogous loci may confound 
phylogenetic analysis and should therefore be excluded. 

Illumina sequencing reactions yield 101 bp or 51 bp 
long reads. Here, we focus on the longest reads, which 
maximize the potential phylogenetic signal (see below for 
a discussion on this issue). Each read starts with a bar- 
code sequence identifying a sample (up to 8 bp long) and 
the 8 bp restriction site followed by 85 bp of usable data. 
We thus extracted 85 bp in 5' and 3' of each restriction 
site to set up the list of RAD sequences existing in each 
Drosophila genome. To identify sets of homologous 
sequences among the 12 species, we used a two-step pro- 
cedure: first, we performed all-against-all BLASTN com- 
parisons (Altschul et al. 1990); and second, we analyzed 
these results with the SiLiX software (Miele et al. 2011), 
to cluster sequences sharing a minimum level of sequence 
identity over a minimal length (see below for a discussion 
on the optimal parameter values). As can be seen in 
Table 1, the proportion of orthologous RAD tag pairs 
retrieved by BLASTN and SiLiX (i.e., gathered in the 
same cluster) decreases with divergence time from 99% 
(between D. melanogaster and D. simulans, 5.4 My of 
divergence) to 49% (D. melanogaster and D. wilistoni, 
63 My of divergence). 

Restriction sites located in repeated regions may 
be problematic for phylogenetic analysis and should 
therefore be identified and discarded. For every pair of 
species, we thus identified clusters containing more than 
one locus from at least one of the two species. The exclu- 
sion of these clusters leads to a loss of less than 10% of 
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Table 1. Number of known and retrieved orthologous RAD tags in 
each species pair. "Orthologous tags": total number of orthologous 
RAD tags present in pairwise alignments (D. melanogaster vs. each of 
the 1 1 other species). "Retrieved orthologous tags": proportion of or- 
thologous tags clustered by SiLiX. "In clusters including paralogs": 
proportion of the retrieved orthologous tags clustered in groups con- 
taining more than one locus. Loci are defined here based on genome 
sequences (see main text). Node depth from Tamura et al. (2004). 
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the RAD tags (Table 1). Among the clusters containing 
only single copy loci, we observed 2% of "false positives," 
that is, clusters containing loci that are not considered as 
orthologous based on whole-genome alignments. 

Phylogenetic inference from 
simulated RAD-seq data 

The above estimates rely on genome alignments for the 
identification of clusters containing paralogous RAD loci. 
In a true RAD-seq experiment, the "locus" definition 
would be solely based on an initial step of clustering of 
reads from each specimen. At this step, recently dupli- 
cated RAD tags may be mistakenly grouped into a single 
"locus" and too divergent alleles from the same locus can 
be mistakenly identified as two different loci. To estimate 
more realistically which proportion of the data would be 
usable for phylogenetic inference, we therefore simulated 
sequencing and "intra-specimen" clustering steps in our 
analysis. This simulation also allows us to test the impact 
of sequencing errors, heterozygosity, and coverage varia- 
tion. 

To assess the effect of heterozygosity on data analysis, a 
second haploid genome was simulated for each species 
using random mutations of the sequenced genome, to 
produce a 5% average distance between homologous 
alleles (the upper bound of realistic polymorphism val- 
ues). RAD sequencing was then simulated by randomly 
sampling reads from the list of all possible RAD loci 
from diploid genomes, with a 10 x mean coverage per 



locus, which was shown to cover 99% of all RAD loci at 
least once (see below). For each sampled read, sequencing 
errors were added with a uniform 1% error rate, that is, 
the upper bound of Illumina error rate estimates (Illu- 
mina technical support, Glenn 2011). Intra-individual 
clustering of the reads was performed using the ustacks 
program (Catchen et al. 2011). This program forms 
"stacks" (groups of strictly identical reads within individ- 
uals) and clusters similar stacks to form putative loci, 
with a user-defined maximum number of differences 
between stacks within a locus. Ustacks finally aggregates 
secondary reads (that were not initially placed in a stack) 
into existing stacks, to better estimate coverage. To maxi- 
mize the number of retrieved heterozygous loci, we 
allowed up to 13 differences between stacks within a 
locus. We observed that this high value did not increase 
the number of inferred loci containing paralogous reads 
(3% of the loci). We also allowed up to nine mismatches 
to cluster secondary reads to putative loci. Using such 
parameters, 11.6% of the loci were mistakenly split by 
ustacks. This shows that a high level of heterozygosity in 
the data does not hamper the clustering of reads within 
loci (only few reads from the same locus are mistakenly 
clustered into different loci). 

Orthologs were then searched with BLASTN and SiLiX 
as described above, using as input the consensus of each 
ustacks locus. We observe that 1.1% of the clusters that 
are known to contain paralogs (based on genome align- 
ments) are not identified as such in the simulated experi- 
ment, as they were mistakenly merged by ustacks in a 
single locus. These correspond to very recent duplication 
events that should not confound the phylogenetic signal, 
unless many of these duplications are anterior to the lat- 
est speciation event. 

For each cluster, sequences were aligned with Muscle 
v3.8.31 (Edgar 2004) using default parameters. We 
selected and concatenated the 2,275 topologically infor- 
mative alignments (i.e., containing sequences from at least 
four different species, Fig. 2), adding gap sequences to 
represent missing orthologs. The proportion of gaps in 
the alignment varied from 19.2% for D. simulans to 
94.9% for D. wilistoni (Table 2), which is expected con- 
sidering the topology of the phylogenetic tree (the proba- 
bility to recover at least three orthologs for a given locus 
is higher for species that have many close relatives in the 
set of sampled species). A maximum likelihood phylogeny 
was built using PhyML (Guindon et al. 2010). Bootstrap 
support was calculated with 100 replicates. The expected 
phylogenetic relationships among the 12 Drosophila spe- 
cies, established from whole-genome comparisons (Clark 
et al. 2007), were correctly recovered with strong boot- 
strap supports (Fig. 3). Because the number of RAD loci 
conserved between genomes decreases with divergence 
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Figure 2. Results of SiLiX clustering of RAD sequences from the 12 
Drosophila genomes. 



Table 2. Percentage of gaps for each species in the concatenated 
alignment, after exclusion of loci present in less than four species. 
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time, the number of loci that can be used to infer phylog- 
eny tends to decrease for the deepest branches of the tree 
(Fig. 3). However, there remains enough phylogenetic sig- 
nal to infer the correct topology with high confidence, 
even for the most ancient branches. 

Practical issues: sequencing coverage, 
number of specimens, and read 
length 

In a typical RAD-seq experiment, DNA samples from sev- 
eral individuals are tagged with molecular identifiers and 
multiplexed in the same flow cell lane. The average 
sequencing coverage per locus per individual is given by 
the following formula: 
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Figure 3. RAD-seq-based phylogeny of the 12 Drosophila species, 
based on 100-bp-long RAD-seq reads, inferred by maximum 
likelihood using PhyML 3.0. under a GTR+ G substitution model, 
using the concatenated alignments from orthologous-only clusters 
containing at least four sequences. Bootstrap values (100 replicates) 
were equal to 100% on every node. We indicate the number of 
informative loci at each node (shared by at least one species on each 
side of the bifurcation and at least one outgroup). 



coverage ■ 



E;:f2xs, 



where R is the total number of reads, N the number of 
individuals included in the library preparation, and S; the 
number of restriction sites in the genome of individual i. 
Increasing the pool size allows a decrease in sequencing 
cost per sample, but leads to a lower coverage per locus, 
which should affect the number of loci that can be recov- 
ered. The proportion of loci that was sequenced at least 
once in the in silico experiment follows a Poisson distri- 
bution: with 5x or higher coverage, more than 99% of 
the expected loci are sequenced at least once; this propor- 
tion drops to 95% with a 3x coverage. However, RAD 
loci represented by only one read are not identified as 
valid RAD loci by ustacks (Emerson et al. 2010; Morris 
et al. 2011). Using two as the minimum coverage to cre- 
ate a stack in ustacks parameters, the proportion of loci 
for which at least one allele was recovered after the intra- 
individual clustering step is 88.3% for a 10 x mean cover- 
age, which was sufficient in our simulations to recover the 
expected phylogeny. The fact that only 88% of the loci are 
recovered (although 99.95% are sequenced at least twice) 
is due to the high levels of polymorphism and sequencing 
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error used in our simulations. Although those values are 
conservative, we would recommend an increased coverage. 

Notably, a significant proportion of restriction sites fall 
in recently duplicated regions, which reduces the number 
of RAD loci actually usable for phylogeny. For example, the 
reference sequence (haploid) of the D. melanogaster genome 
contains 2,948 restriction sites, that is, 5,896 potential RAD 
tags. Intra-individual clustering of reads from this genome 
using ustacks yields 4,296 clusters, 3% of which contain 
reads from more than one locus. This small proportion of 
clusters containing recently duplicated regions represents a 
substantial proportion of all reads (25.3%). Thus, 74.7% of 
reads actually fall in non-recently duplicated regions. 

The 12 Drosophila species analyzed here contain on aver- 
age 2,308 Sbfl restriction sites per genome. Sbfl is a rare- 
cutter enzyme because of its 8-bp-long and GC-rich restric- 
tion site. It is expected to yield among the lowest number of 
restriction sites per genome. This property is valuable for 
studies where a large number of specimens matters more 
than a large number of loci per genome. For example, with 
an Ulumina Hiseq 2000 flow cell lane (R = 1.5 x 10 8 reads), 
and for a targeted coverage of lOx, it would, in theory, be 
possible to pool DNA samples from up to 1000 specimens 
(for a genome size comparable to that of Drosophila). Of 
course, a number of issues would potentially lower the cov- 
erage for some loci or some specimens (e.g., genome size, 
GC content, DNA template concentration, and quality). 
Targeting a 50 x mean coverage would allow the analysis of 
200 specimens in one flow cell lane, and would be robust to 
a 10-fold variation in coverage between loci or specimens. 

With existing sequencing technologies, it is also possible 
to increase coverage or reduce sequencing cost by decreas- 
ing the length of the reads. For example, Illumina sequenc- 
ing runs can yield 101-bp- or 51-bp-long reads. We 
investigated whether 51 bp RAD-seq reads could be used 
for phylogenetic studies. We extracted 35 bp sequences on 
both side of each restriction sites and performed the in sil- 
ico RAD-seq experiment and phylogenetic analysis as 
described above (see supplementary material for details). 
Overall, the topology shows only one error (the position of 
D. wilistoni is incorrect). Recent nodes (younger than 
45 My) were resolved with high support within the Mela- 
nogaster group (D. melanogaster, D. simulans, D. sechellia, 
D. yakuba, D. erecta, and D. ananassae), but bootstrap val- 
ues were low (<80%) for some deep nodes. This poor sup- 
port for deep nodes suggests that short reads should only 
be used to resolve short-scale phylogenies. 

Orthology inference: SiLiX versus 
uclust 

As we were finishing this manuscript, Rubin et al. (2012) 
published a comparable analysis and also reached the 



conclusion that RAD-seq can be a useful tool for phyloge- 
netic inference. One notable difference between the two 
studies is the use of different procedures to identify or- 
thologous loci: uclust (Edgar 2010) versus SiLiX. 

In an attempt to optimize this crucial step, we con- 
ducted a comparison of the two methods using the same 
data set and various parameter values. We estimated the 
efficiency of orthology detection by calculating the pro- 
portion of orthologous RAD loci (known from the 
whole-genome alignment) that were correctly recovered 
(i.e., known orthologs gathered in the same cluster, with- 
out inclusion of any paralog). It should be noticed that 
this definition of "efficiency" reflects the combined effect 
of sensitivity (the ability to gather orthologous sequences) 
and specificity (the ability to exclude paralogs). To assess 
the effect of evolutionary distance on clustering efficiency, 
we computed this measure for clusters shared by at least 
4, 6, or 9 species (Fig. 4). 

The first step of the SiLiX procedure consists in compar- 
ing all sequences against each other with BLASTN. To 
avoid the detection of spurious sequence similarities, we 
allowed the filtering of low complexity sequences (parame- 
ter F = T, which is set by default in BLASTN) and we set 
the £-value parameter to 10~ 4 (i.e., stringent enough to 
avoid the detection of similarities between non-homolo- 
gous sequences). In the second step, SiLiX takes into 
account two parameters to cluster two sequences in a fam- 
ily: the fraction of their length covered by the BLASTN 
alignment (alignment overlap) and their sequence similar- 
ity (in the aligned region). The clustering of uclust is based 
on a global alignment, and requires one single parameter 
(the minimum percentage of identity over the entire 
sequence length). As expected, for both methods, the effi- 
ciency of orthology detection decreases with increasing evo- 
lutionary distance (compare in Fig. 4 the efficiency for 
RAD loci shared by 4, 6, or 9 taxa). As noted by Rubin 
et al. (2012), the efficiency of uclust decreases when the 
sequence similarity threshold is set to very high value 
( > 80%), and we observed the same trend for SiLiX 
(Fig. 4). For SiLiX, the efficiency also tends to decrease 
with increasing alignment overlap threshold. We there- 
fore recommend using relatively permissive thresholds 
(sequence identity > 35%, and sequence overlap > 35%). 
Interestingly, we observed that SiLiX was significantly more 
efficient than uclust, especially for clusters including some 
relatively distant homologs: for loci shared by at least nine 
species, SiLiX was two times more efficient than uclust 
(44.7% vs. 21.6% efficiency). In principle, the main interest 
of uclust is that it is extremely fast, because it does not 
require an exhaustive comparison of all sequences against 
each other's. However, given the number of sequences that 
have to be compared in a typical RAD-seq experiment (sev- 
eral thousands of loci for several hundreds of specimens), 
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Figure 4. Proportion of known orthologous sequences retrieved by SiLiX or uclust, for different clustering parameters. Results are shown for 
clusters containing at least 4, 6, or 9 species. Note the x-axes are not the same in all figures. For SiLiX, the values of two parameters are tested: 
the minimum overlap threshold (on the second figure, where the minimum identity parameter is set at 0.35), and the identity threshold (on the 
third figure, where the minimum overlap is set at 0.35). 



calculation time is not limiting. For example, in our analy- 
ses, the clustering of 55,324 sequences took 7 s with uclust 
versus 3 min for BLAST + SiLiX (on a single Macbook Pro 
with a 2.9 GHz processor and 8 Gb of memory). More- 
over, one disadvantage of uclust is that its result depends 
on the sequence that is used as a seed to initiate the cluster- 
ing. Hence, uclust clustering has to be replicated many times 
with different seeds to check results consistency. As the 
BLAST + SiLiX method is exhaustive, appears to be more 
efficient, and does not require to be replicated with different 
seeds, we argue that it should be preferred to uclust. 

Conclusion 

Our phylogenetic analysis using simulated RAD-seq data 
suggests that this method is suitable for interspecific 
comparisons, even for relatively large genetic divergences 
(1.3 substitutions per site, which corresponds to 63 My 
for the Drosophila clade). Enough non-duplicated restric- 
tion sites were conserved between species and sequence 
conservation between orthologous RAD tags was suffi- 
cient to detect homology for a large number of loci. 
Finally, the recovered alignments contained enough phy- 
logenetic information to yield strongly supported phylog- 
enies. It should be noted that in our simulations, we did 
not incorporate incomplete lineage sorting. However, 
empirical data suggest that RAD-seq produces data from 
enough loci to overcome this problem (Wagner et al. 
2013). 

This study further indicates that the SiLiX clustering 
program is more efficient than uclust to identify ortholo- 
gous RAD sequences, especially for distantly related spe- 
cies, for which this feature is most critical. In addition, 
we observed that RAD-seq-based phylogenies are robust 
to sequencing errors and high polymorphism values. In 
practice, we recommend to target a 50 x coverage, which 



is sufficient to sample 99% of all RAD alleles at least 
once, even with a 10-fold coverage variation between 
specimens, and to avoid the use of short reads (50 bp), 
which would lead to significant loss of phylogenetic signal 
for deep nodes. Overall, this study validates and rein- 
forces RAD-seq as a powerful tool for phylogenetic infer- 
ence. 

Methods 

We provide below program versions and parameter values 
used in this study, when not specified in the main text: 

Ustacks (Catchen et al. 2011) 

ustacks -t fasta -f file.fasta -r -m 2 -N 9 -M 13. 

BLASTN: blastall 2.2.25 (Altschul et al. 1990) 

blastall -i file.fasta -d file.fasta -p blastn -o file.blastn - 

W 11 -b 10000 -v 10000 -z 1000000 -e le-4 -m 8 

SiLiX (Miele et al. 2011) 

silix file.fasta file.blastn -r 0.35 -i 0.35 

Muscle V3.8.31 (Edgar 2004) 

Default parameters 

PhyML 3.0. (Guindon et al. 2010) 

phyml -i file -d nt -b 100 -m GTR. 
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Supporting Information 

Additional Supporting Information may be found in the 
online version of this article: 

Figure SI. Phylogeny of the 12 Drosophila species based 
on 50-bp-long RAD-seq reads, inferred by maximum like- 
lihood using PhyML 3.0. under a GTR + G substitution 
model, using the concatenated alignments from ortholo- 
gous-only clusters containing at least four sequences. 
Bootstrap values (100 replicates) are indicated in italics. 
Rad-seq data were simulated with 1% sequencing errors, 
but without polymorphism. Notice that bootstrap sup- 
ports for several deep nodes are low and that the place- 
ment of D. wilistoni is incorrect. Non-italic numbers 
indicate the number of informative loci for each node 
(shared by at least one species on each side of the bifurca- 
tion and at least one outgroup). 
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